Learning to Identify Physiological and Adventitious Metal-Binding Sites in the Three-Dimensional Structures of Proteins by Following the Hints of a Deep Neural Network

Thirty-eight percent of protein structures in the Protein Data Bank contain at least one metal ion. However, not all these metal sites are biologically relevant. Cations present as impurities during sample preparation or in the crystallization buffer can cause the formation of protein–metal complexes that do not exist in vivo. We implemented a deep learning approach to build a classifier able to distinguish between physiological and adventitious zinc-binding sites in the 3D structures of metalloproteins. We trained the classifier using manually annotated sites extracted from the MetalPDB database. Using a 10-fold cross validation procedure, the classifier achieved an accuracy of about 90%. The same neural classifier could predict the physiological relevance of non-heme mononuclear iron sites with an accuracy of nearly 80%, suggesting that the rules learned on zinc sites have general relevance. By quantifying the relative importance of the features describing the input zinc sites from the network perspective and by analyzing the characteristics of the MetalPDB datasets, we inferred some common principles. Physiological sites present a low solvent accessibility of the aminoacids forming coordination bonds with the metal ion (the metal ligands), a relatively large number of residues in the metal environment (≥20), and a distinct pattern of conservation of Cys and His residues in the site. Adventitious sites, on the other hand, tend to have a low number of donor atoms from the polypeptide chain (often one or two). These observations support the evaluation of the physiological relevance of novel metal-binding sites in protein structures.


Figures
Supporting Figure S1. Construction of MBSs. For each metal atom in any selected 3D structure, the non-hydrogen atoms at a distance smaller than 3.0 Å from the metal ion (blue sphere) are identified as its donor atoms (red atoms), i.e. the atoms that bind directly to the metal. The protein residues or small molecules that contain at least one donor atom are the metal ligands (cyan sticks), and constitute the first coordination sphere of the metal ion. The full MBS is obtained by including any other residue or chemical species having at least one atom within 5.0 Å from a metal ligand (pink sticks). Figure S3. Examples of successful predictions by the neural classifier. A) the physiological zinc(II) site of structure 3OXK 1 ; B) one of the two physiological interfacial zinc(II) sites of structure 1U2W 2 ; C) one of the adventitious zinc(II) sites of structure 5CHU 3 ; D) the iron(III) site of structure 1VXV 4 . In all panels, the metal ion of interest is shown as a wheat (zinc) or orange (iron) sphere; the metal ligands are shown as blue sticks; the backbone of the MBS is colored in light blue. The protein backbone is displayed using a cartoon representation. In panels B-C other zinc(II) ions present in the structure are shown as grey spheres. In panel D, the small blue sphere is a water molecule coordinating the iron ion. In panel B, the two protein chains are colored in green and red respectively (except for the MBS).

Tables
Supporting Table S1. List of the 29 features representing each residue of the input sequence. The binding role features define the function of the residue within the MBS (2 for metal ligands; 1 for all other MBS residues; 0 for all other residues in the protein) MSA score of Ala MSA score of Cys MSA score of Asp MSA score of Glu MSA score of Phe MSA score of Gly MSA score of His MSA score of Ile MSA score of Lys MSA score of Leu MSA score of Met MSA score of Asn MSA score of Pro MSA score of Gln MSA score of Arg MSA score of Ser MSA score of Thr MSA score of Val MSA score of Trp MSA score of Tyr Absolute solvent accessibility Relative solvent accessibility Binding role 0 (true/false) Binding role 1 (true/false) Binding role 2 (true/false) Secondary structure: helix (true/false) Secondary structure: sheet (true/false) Secondary structure: turn (true/false) Secondary structure: other (true/false) Supplementary Table S2. Classification of a set of zinc sites extracted from PDB structures released in 2022. P() approximates the real probability distribution; because of the approximation, P(Adventitious)+P(Physiological) may differ slightly from 1.000. This section describes the architectural details of the neural network implemented in this work, with the aim to list all the information needed to build the MBS classifier. The network is composed by three modules: the convolutional, the recurrent and the classification one, as shown in Figure S3. The model is fed with a sequence, representing a given metalloprotein structure, which is encoded as an array of size d

Convolutional module
This module is implemented with two blocks as shown in Fig. S4. Each block is composed of a one-

Recurrent Module
This module is implemented with a recurrent neural network (RNN) having Gated Recurrent Units (GRU) ( Figure S5). The network has 3 layers each with 15 neurons, ReLU activation functions and dropout of 0.6. The input sequence is obtained from the convolutional module; therefore its shape is known. The RNN input size is constrained to be equal to the number of output channels in the second conv1D block.

Figure S5. Architecture of the recurrent module
An RNN fed with a sequence generates a sequence as output. Anyway, here we only take the last element of the RNN-generated sequence to be processed by the subsequent module. Therefore, the output array has size 15.

Fully Connected Module
Once the array/embedding representing the site is generated by the RNN, it is fed in a fully connected layer (also known as linear layer), having a number of input units equal to the number of output units of the RNN (architectural constraint), that is 15, and 2 output neurons. These 2 neurons fire the class-probability distribution for the give input site fed as input to the model.

Training process
The zinc(II) dataset was randomly split in 10 stratified folds; accordingly, 10 different models have been trained, each with a different train-validation-test set, acting a classical cross-validation process.
Architecture parameters (such as the number of layers, dropout values etc.) were adjusted during the model search process, performed on a random subsample of the training data of a randomly chosen fold (to reduce computational time) and tested on its associated validation data The training process has been performed using the Adam optimization algorithm, (a variant of the classical stochastic gradient descent), with a learning rate 0.001, processing batch of 16 sequences per time, data is shuffled after each epoch. The training step is repeated for 200 epochs ( Figure S7), and the model configuration (weights value) having the best performances (accuracy and loss value) on the validation set is kept. In the calculation of the loss value, different weights are given to data points of the different classes, in order to not be biased by the class having more points. The cost function used was the mean squared error (MSE).